Istio causing problems with dispatching
# spicedb
w
Hello! I've been running SpiceDB with dispatch disabled for a while now. I've tried re-enabling it again, and hit the same problem as I did in the past: it's far worse than having it disabled! See these screenshots (dispatch enabled at 11:57, rolled back at 12:07) - Latency goes through the roof, at all percentiles. Eg the p50 is ~6x worse, P99 ~12x worse (ignoring the massive spike that happened at 12:00, which is also a problem in itself) - Availability goes way down. I'm seeing logs (3rd and 4th screenshots) that are probably related. I'm not getting that many of these logs (~500 in the period where dispatch was enabled) - cache hit rate doesn't improve at all, it's still ~30%. I was expecting the dispatch cluster to mean better caching? I see 3 possible scenarios here: - Dispatch just doesn't work - Dispatch doesn't work for my particular workload - I'm doing something wrong Note: it's nothing new, dispatch never worked for me since at least 1.13.0 What do you think, any pointer? https://cdn.discordapp.com/attachments/844600078948630559/1238452244186402876/image.png?ex=663f5608&is=663e0488&hm=b31ecbc585fbc9baa34f83f4fb5e786d1cbe51680e1a0ffaa1bdf96651a8b00f& https://cdn.discordapp.com/attachments/844600078948630559/1238452244454834186/image.png?ex=663f5608&is=663e0488&hm=ef5c32a3a999affada4bacaf91279e2936d4864a60b456376d84fd00af42566d& https://cdn.discordapp.com/attachments/844600078948630559/1238452244769411092/image.png?ex=663f5608&is=663e0488&hm=5cf3fc8ca626f8c68c257da2e8a2b958edfa90e25de6694efb5994f19a4d5b64& https://cdn.discordapp.com/attachments/844600078948630559/1238452245100499056/image.png?ex=663f5608&is=663e0488&hm=27bf486675cb6f0c39a30f60f2feefa5ea205ce3b82f465b660050d9566b6b48& https://cdn.discordapp.com/attachments/844600078948630559/1238452245411135488/image.png?ex=663f5608&is=663e0488&hm=5ea2fd135cac219ee24895ff6f4ea29be6915a1d99358f364973875542a5761a&
v
We've never not run without dispatch, and all our managed offerings run with dispatch enabled, without errors nor latency impact when the clusters roll. That does not look right to me. I'd probably use opentelemetry to trace the requests and see where the time is being spent. I'd suspect something up with your Kubernetes setup / networking. What does your setup look like? are you using istio? Sidecars?
w
We do use Istio (via sidecars). It's not my domain of expertise, but I've learned to look at istio suspiciously everytime something goes wrong indeed
Lemme see if I can get OTel working again
For my understanding, what sort of cache hit rate do you see? Is it highly dependent on data/workload or does it tend to be pretty consistent?
v
We've had reports from folks having issues with istio, and removing it fixing them Istio does not make a lot of sense here since this is an internal API to SpiceDB, each SpiceDB version knows how to talk to each other, and it's not meant to be used like a public API. Unless you have a specific networking setup, I don't see the value in using it with dispatch.
my guess it's related to istio, it adds overhead in the request path. I'd suggest trying without it.
Cache hit ratio is highly dependent on your workload. And again, we want to stress that SpiceDB caching mechanism is a hot-spot caching: it's a mechanism to dedupe requests
20-30% seems like the type of caching we see.
w
the istio thing makes sense. I'll ask for support to my infra team to try removing it
v
is your setup perhaps crossing kube-cluster boundaries using istio? like cross datacenter?
w
It's all in the same cluster. Pods could be on different AZ though
v
that's ok, networking overhead between AZ should be in the realm of submillisecond latency in most cloud providers
w
given I'm already at ~30% cache, what benefits should I expect from a working dispatch cluster?
v
Horizontal scalability and better usage of your database
You are basically looking at reducing your db usage by in the worst case
w
It's very much what I'm after, but I'm not really understanding how if we don't expect the cache hit rate to go up 🤔
v
Your requests are load balanced across 3 caches. If you have to solve sub problem A, that sub problem will have to be solved by each SpiceDB, which triples the number of times it needs to access the database.
The caches are both client and server side, caches at both ends of the dispatch ring. So effectively you end up with caches populated the same way. The difference is the amount of work you had to do to populate them.
w
I'll have to ponder over that but sounds reasonable! Thank you
j
it also allows singleflight to truly work
rather than, if you're running three pods, it becoming "3-flight"
y
iirc istio also behaves as an internal load balancer and that could be causing problems
j
yeah, we've had multiple reports of problems with dispatching and sidecars like istio
w
We've disabled Istio, seems like the problem is indeed gone 👍 Cache hit rate went up to ~50% and database CPU load is reduced by almost half 🙌 Thank you for the pointer!
j
perfect
v
nice!
40 Views